Augmented reality object manipulation

ABSTRACT

A processing system having at least one processor may detect a first object in a first video of a first user and detect a second object in a second video of a second user, where the first video and the second video are part of a visual communication session between the first user and the second user. The processing system may further detect a first action in the first video relative to the first object, detect a second action in the second video relative to the second object, detect a difference between the first action and the second action, and provide a notification indicative of the difference.

This application is a continuation of U.S. Pat. Application Serial No.17/403,611, filed on Aug. 16, 2021, now U.S. Pat. No. 11,567,572, whichis herein incorporated by reference in its entirety.

The present disclosure relates generally to visual communicationsessions, and more particularly to methods, computer-readable media, anddevices for providing a notification indicative of a detected differencein actions relating to a first object in a first video and a secondobject in a second video of a visual communication session.

BRIEF DESCRIPTION OF THE DRAWINGS

The teachings of the present disclosure can be readily understood byconsidering the following detailed description in conjunction with theaccompanying drawings, in which:

FIG. 1 illustrates an example network related to the present disclosure;

FIG. 2 illustrates an example of a visual communication session betweena first user and a second user having respective wearable augmentedreality devices;

FIG. 3 illustrates a flowchart of an example method for providing anotification indicative of a detected difference in actions relating toa first object in a first video and a second object in a second video ofa visual communication session; and

FIG. 4 illustrates a high level block diagram of a computing devicespecifically programmed to perform the steps, functions, blocks and/oroperations described herein.

To facilitate understanding, identical reference numerals have beenused, where possible, to designate identical elements that are common tothe figures.

DETAILED DESCRIPTION

In one example, the present disclosure describes a method,computer-readable medium, and device for providing a notificationindicative of a detected difference in actions relating to a firstobject in a first video and a second object in a second video of avisual communication session. For instance, in one example, a processingsystem having at least one processor may detect a first object in afirst video of a first user and detect a second object in a second videoof a second user, where the first video and the second video are part ofa visual communication session between the first user and the seconduser. The processing system may further detect a first action in thefirst video relative to the first object, detect a second action in thesecond video relative to the second object, detect a difference betweenthe first action and the second action, and provide a notificationindicative of the difference.

Examples of the present disclosure provide a network-based system thatfacilitates proper physical object manipulation via a visualcommunication session (e.g., mixed reality (MR), augmented reality (AR),virtual reality (VR), or other video-based communication sessions)between at least two users. For instance, in one example, the presentdisclosure may facilitate remote assistance where one party is engagedto help another fix an issue relating to an object/item. For example, avisual communication session may be established between endpoint devicesof two users. In one example, a processing system of the presentdisclosure may detect and/or map respective environments and objectstherein via computer vision processing techniques, 3D sensor readings,or other local sensor techniques (e.g., infrared (IR) light fielddetection, etc.). For instance, the system may determine specificobjects and room topology for segmentation. In one example, the presentdisclosure may also determine users’ intentions, on both sides of thevisual communication session, so that the system can isolate whichobjects are being interacted with. In one example, the presentdisclosure may also mask out and transmit relevant aspects of theenvironment and object(s). In particular, transmission of video, audio,and virtual objects may be constrained by a combination of objects,environment/room features, and user intention. In addition, in oneexample, the present disclosure may also align to specific objects oneither side of the transmission. For example, the present disclosure mayalign an expert’s manipulation of a router to a router on a client’sside.

To illustrate, an expert may be communicating with a remote user via avisual communication session to repair equipment at the user’s house. Ifthe user does not know how to repair the equipment, the expert mayinstruct the user verbally. As in the past, the expert may provide avideo, animation, or the like, or may draw additional visuals to showthe user. In addition, the expert may provide a demonstration via videofor the user to see and follow. However, in accordance with the presentdisclosure, a user’s virtual environment may be enhanced, e.g., withrespect to the actual item/object in front of the user, to demonstrate aproper repair. It should be noted that although the term “expert” isused throughout the present disclosure, as referred to herein, the termis not intended to convey a guarantee of a particular level ofexpertise, but rather is indicative of any user who may be assisting orinstructing another user, e.g., even if the advice or assistance iswrong, even if other “experts” may disagree with the recommended ordemonstrated actions, etc.

In an illustrative example, an expert (broadly an instructor) andanother user may start a new visual communication session. In oneexample, the establishment of the visual communication session may betriggered by passive monitoring or specific mentioning of a task, inresponse to which the user’s device may connect the user’s device and adevice of the expert via a network-based processing system of thepresent disclosure. The processing system may therefore create thesession and begin to detect objects and aspects of the environments forboth sides. Each side may include cameras, microphones, positionalsensors, and other sensors, AR devices, and so forth, from which theprocessing system may collect various data. In one example, theprocessing system may also be provided with or have access to knowntopologies (e.g., floor plan, etc.). Accordingly, the processing systemmay generate maps of the three-dimensional (3D) environments of bothsides.

The processing system may additionally determine objects of interest forthe visual communication session. For example, if the task is known andrelates to fixing a set-top box (STB) or television problem, only theobject models relevant to the task may be loaded for detection. As such,visual data feeds from both sides may be evaluated against specificobject models to detect the presence of, and to identify those specificobjects/items. It should be noted that other tasks may have other objectmodels related thereto and which may be loaded/activated when a visualcommunication session for such tasks may take place. In one example,object detection and recognition may be enhanced by reference torecord(s) of the user’s current install and equipment, e.g., as recordedin a telecommunication service provider user database.

In addition, the processing system may determine intentions of actors onboth sides. For example, the processing system may observe movements ofboth the expert and the user, and may utilize other engagement metrics(e.g., gaze direction, touch sensing, vocal utterances, and so on).Alternatively, or in addition, the processing system may determine ifthe users are moving objects, or moving two or more objects near eachother (as defined/needed by the task), such as bringing a cable closerto a television or STB. Accordingly, the processing system may resolvespecific objects of interest by a combination of (1) object detectionand/or identification and (2) user intention. In one example, theprocessing system may edit video/visual feeds to transmit objects ofinterest and to mask/hide irrelevant objects or other aspects of theenvironment, e.g., unless the users determine more background is helpfulor required.

In accordance with the present disclosure, both parties may be equippedwith force feedback gloves (e.g., haptic gloves) to render co-presenttouch of objects on both sides. For example, if the user is engaged withan item (e.g., a screwdriver, or a cable) and using a force feedbackglove (or gloves), the processing system may allow the user torelinquish control and enable the expert to assume remote control of theuser’s force feedback glove(s) via the expert’s force feedback glove(s).For instance, the expert and the user may both engage one or moreobjects with the respective gloves and hold the object(s) in relativelythe same positions, at which point a command or commands from one orboth of the parties may activate remote control by the expert. Theexpert’s force feedback glove(s) can then send signals for manipulatingan object that trigger corresponding actions on the user’s forcefeedback glove(s). For example, the expert may manipulate the object(s)on the expert’s side, and the user’s force feedback glove(s) may beengaged via control signals to perform corresponding manipulations ofthe same/similar object(s) on the user’s side.

In one example, the expert’s force feedback glove(s) may translate theexpert’s motions so that remote user will feel the directionality of theexpert’s instructions and/or gestures, and potentially take advantage ofthe expert’s advanced manual dexterity. In addition, in one example, theexpert may define maximum force boundaries in specific tasks by settingexample “pull this hard” limits, and the customer can feel resistancebased on that transmitted intensity. In one example, force feedback maybe tuned or overridden according to user comfort or compliance in theinteraction. Upon completion, thresholds for interaction, alignment,etc. may be saved and used for a future review of the expert’s taskexecution, for automated learning of the task for automated expertavatars, virtual assistants, or the like, and so on.

In one example, newly relevant objects on the user side may becomevisible to the expert, such as safety hazards. For instance, a stack ofpapers may be irrelevant to proper task performance and may initially beexcluded from the video/visual feed. However, if the stack of papers isnear a newly plugged-in piece of electronic equipment, such as a STB,the papers may now be a fire hazard and may be brought back into thevideo/visual feed from the user’s side that is transmitted to theexpert. Similarly, tools brought into view but not acknowledged by theuser may nevertheless be included in the video/visual feed of the userthat is transmitted to the expert, e.g., based on knowledge of theparticular task. For instance, if the task is known, then the tools thatare customarily used for the task may also be known. Thus, even if thetool is not acknowledged by the user, it may nevertheless be consideredrelevant and included in the view of the user that is provided to theexpert.

In one example, the user and/or expert may “swipe out” objects orgesture to blur out objects from the interaction. In addition, theprocessing system may maintain a record of these selections for a nextiteration/interaction to exclude similar categories or specific objects(e.g., valuables or dirty laundry in a closet). Likewise, in oneexample, the processing system may redact or remove objects, or may addvirtual objects based upon users’ security levels, priority, etc.Similarly, the user, the expert, or both may override modifications tothe videos/visual feeds for safety concerns (e.g., boiling water that isdangerous and needs attention). In one example, the present disclosuremay also cause augmented reality (AR) displays to re-light aspects of anenvironment and/or to activate ambient lighting to relight and targetspecific objects in a room or other environments (e.g., projectors forvisuals, and similarly audio for interactions).

In one example, the present disclosure may incorporate a machinelearning (ML)-based virtual assistant to first attempt diagnostics ortriage of the problem, thus freeing experts for more intricate/novelissues. In one example, visual communication sessions for a task/problemmay be recorded by the processing system and organically added aspotential solutions (e.g., at least from the expert side). For example,as more instances of one or more experts repairing a device arerecorded, the tools used to repair the device may be added to the objectmanifest for faster detection and alternate solution discovery (e.g., “Isaw a previous customer use tool X to solve this shortcoming”). In oneexample, a virtual assistant (VA)/voice assistant may supplement anexpert in assisting a user in correctly solving a problem by providingsuggested next steps or other hints based upon knowledge of the task(e.g., “you may be too far from the model,” “you’ll need this tool nextso keep it close,” etc.). Similarly, the processing system may learnautomated steps and correctly branch to those automated steps to matchthe best conditions for the benefit of both the expert and the user(e.g., to enable the expert to not have to look-up or remember a nextstep for a complex task, a task that is performed infrequently, etc.).In addition, examples of the present disclosure may provide improveddisambiguation for complex tasks. For instance, if there are multiple“yellow cables,” the advanced color analysis of the processing systemmay provide a user with a better indication of which one to use (e.g.,by matching to the most correct shade of yellow).

Thus, examples of the present disclosure give a highly precise alignmentof environments and objects to enable remote control (e.g., if bothsides are manipulating the same object, the processing system can maphand placement). In addition, the force feedback provided thereby mayenable potential muscle memory/training and may be used for repair orassembly tasks, remote coaching (e.g., music playing or learningsports), and so on. In addition, examples of the present disclosureenable a remote expert to interact with a user in the user’s environmentwhile maintaining user privacy. An expert is able to “remote control” afix by touching an example object on the expert’s side, and using systemalignment on the user side. In one example, the processing system mayreplicate the expert’s manipulation of the object on the customer’sforce feedback glove(s). Alternatively, or in addition, the targetobject may be made to blink, highlighted or otherwise visually enhancedto draw the user’s attention. By synchronizing interactions between bothsides (instead of just reading off a workflow), the expert is freed tofocus more heavily on interaction with the user. These and other aspectsof the present disclosure are described in greater detail below inconnection with the examples of FIGS. 1-4 .

To further aid in understanding the present disclosure, FIG. 1illustrates an example system 100 in which examples of the presentdisclosure may operate. The system 100 may include any one or more typesof communication networks, such as a traditional circuit switchednetwork (e.g., a public switched telephone network (PSTN)) or a packetnetwork such as an Internet Protocol (IP) network (e.g., an IPMultimedia Subsystem (IMS) network), an asynchronous transfer mode (ATM)network, a wireless network, a cellular network (e.g., in accordancewith 3G, 4G/long term evolution (LTE), 5G, etc.), and the like relatedto the current disclosure. It should be noted that an IP network isbroadly defined as a network that uses Internet Protocol to exchangedata packets. Additional example IP networks include Voice over IP(VoIP) networks, Service over IP (SoIP) networks, and the like.

In one example, the system 100 may comprise a network 102, e.g., atelecommunication service provider network, a core network, anenterprise network comprising infrastructure for computing andcommunications services of a business, an educational institution, agovernmental service, or other enterprises. The network 102 may be incommunication with one or more access networks 120 and 122, and theInternet (not shown). In one example, network 102 may combine corenetwork components of a cellular network with components of a tripleplay service network; where triple-play services include telephoneservices, Internet services and television services to subscribers. Forexample, network 102 may functionally comprise a fixed mobileconvergence (FMC) network, e.g., an IP Multimedia Subsystem (IMS)network. In addition, network 102 may functionally comprise a telephonynetwork, e.g., an Internet Protocol/Multi-Protocol Label Switching(IP/MPLS) backbone network utilizing Session Initiation Protocol (SIP)for circuit-switched and Voice over Internet Protocol (VoIP) telephonyservices. Network 102 may further comprise a broadcast televisionnetwork, e.g., a traditional cable provider network or an InternetProtocol Television (IPTV) network, as well as an Internet ServiceProvider (ISP) network. In one example, network 102 may include aplurality of television (TV) servers (e.g., a broadcast server, a cablehead-end), a plurality of content servers, an advertising server (AS),an interactive TV/ video on demand (VoD) server, and so forth.

In accordance with the present disclosure, application server (AS) 104may comprise a computing system or server, such as computing system 400depicted in FIG. 4 , and may be configured to provide one or moreoperations or functions for providing a notification indicative of adetected difference in actions relating to a first object in a firstvideo and a second object in a second video of a visual communicationsession, as described herein. It should be noted that as used herein,the terms “configure,” and “reconfigure” may refer to programming orloading a processing system with computer-readable/computer-executableinstructions, code, and/or programs, e.g., in a distributed ornon-distributed memory, which when executed by a processor, orprocessors, of the processing system within a same device or withindistributed devices, may cause the processing system to perform variousfunctions. Such terms may also encompass providing variables, datavalues, tables, objects, or other data structures or the like which maycause a processing system executing computer-readable instructions,code, and/or programs to function differently depending upon the valuesof the variables or other data structures that are provided. As referredto herein a “processing system” may comprise a computing deviceincluding one or more processors, or cores (e.g., as illustrated in FIG.4 and discussed below) or multiple computing devices collectivelyconfigured to perform various steps, functions, and/or operations inaccordance with the present disclosure.

Thus, although only a single application server (AS) 104 is illustrated,it should be noted that any number of servers may be deployed, and whichmay operate in a distributed and/or coordinated manner as a processingsystem to perform operations for detecting and modifying actions ofvisual representations of users in visual content, in accordance withthe present disclosure. In one example, AS 104 may comprise a physicalstorage device (e.g., a database server), to store various types ofinformation in support of systems for providing a notificationindicative of a detected difference in actions relating to a firstobject in a first video and a second object in a second video of avisual communication session, in accordance with the present disclosure.For example, AS 104 may store object recognition models, user data(including user device data), task templates/maps (e.g., comprisingsequences of steps/actions to perform to complete an overall task), andso forth that may be processed by AS 104 in connection with establishingvisual communication sessions, or that may be provided to devicesestablishing visual communication sessions via AS 104. AS 104 mayfurther create and/or store action detection models that may be utilizedin connection with the present examples. For ease of illustration,various additional elements of network 102 are omitted from FIG. 1 .

In one example, the access networks 120 and 122 may comprise DigitalSubscriber Line (DSL) networks, public switched telephone network (PSTN)access networks, broadband cable access networks, Local Area Networks(LANs), wireless access networks (e.g., an IEEE 802.11/Wi-Fi network andthe like), cellular access networks, 3^(rd) party networks, and thelike. For example, the operator of network 102 may provide a cabletelevision service, an IPTV service, or any other types oftelecommunication service to subscribers via access networks 120 and122. In one example, the access networks 120 and 122 may comprisedifferent types of access networks, may comprise the same type of accessnetwork, or some access networks may be the same type of access networkand others may be different types of access networks. In one example,the network 102 may be operated by a telecommunication network serviceprovider. The network 102 and the access networks 120 and 122 may beoperated by different service providers, the same service provider or acombination thereof, or may be operated by entities having corebusinesses that are not related to telecommunications services, e.g.,corporate, governmental or educational institution LANs, and the like.

In one example, the access network 120 may be in communication with adevice 131. Similarly, access network 122 may be in communication withone or more devices, e.g., device 141. Access networks 120 and 122 maytransmit and receive communications between devices 131 and 141, betweendevices 131 and 141, and application server (AS) 104, other componentsof network 102, devices reachable via the Internet in general, and soforth. In one example, each of devices 131 and 141 may comprise anysingle device or combination of devices that may comprise a userendpoint device. For example, the devices 131 and 141 may each comprisea mobile device, a cellular smart phone, a wearable computing device(e.g., smart glasses, goggles or headsets) a laptop, a tablet computer,a desktop computer, an application server, a bank or cluster of suchdevices, and the like. In one example, devices 131 and 141 may eachcomprise programs, logic or instructions for providing a notificationindicative of a detected difference in actions relating to a firstobject in a first video and a second object in a second video of avisual communication session. For example, devices 131 and 141 may eachcomprise a computing system or device, such as computing system 400depicted in FIG. 4 , and may be configured to provide one or moreoperations or functions in connection with examples of the presentdisclosure for providing a notification indicative of a detecteddifference in actions relating to a first object in a first video and asecond object in a second video of a visual communication session, asdescribed herein.

In one example, the device 131 is associated with a first user (user 1)191 at a first physical environment 130. As illustrated in FIG. 1 , thedevice 131 may comprise a wearable computing device (e.g., smartglasses) and may provide a user interface 135 for user 191. Forinstance, device 131 may comprise smart glasses with augmented reality(AR) enhancement capabilities. For example, endpoint device 131 may havea screen and a reflector to project outlining, highlighting, or othervisual markers to the eye(s) of user 191 to be perceived in conjunctionwith the surroundings. In the present example, device 131 may present awindow 137 within or via the user interface 135. Also associated withuser 191 and/or first physical environment 130 is a camera 132 which maybe facing user 191 and which may capture a video comprising the firstphysical environment 130, including user 191 and other items or objectstherein, such as sticks A and B. In one example, camera 132 maycommunicate with device 131 wirelessly, e.g., to provide a video streamof the first physical environment 130. As an alternative, or inaddition, in one example, device 131 may also comprise an outward facingcamera to capture video of the first physical environment 130 from afield of view in a direction that user 191 is looking.

In one example, the device 131 may present visual content of one or moreother users via user interface 135 (e.g., presented as window 137 inFIG. 1 ). In one example, the physical environment 130 and userinterface 135 may comprise an augmented reality (AR) or a mixed reality(MR) environment, e.g., when the physical environment 130 remainsvisible to user 191 when using device 131, and the visual contentreceived from one or more other users is presented spatially in anintelligent manner with respect to the physical environment 130. In oneexample, the components associated with user 191 and/or first physicalenvironment 130 that are used to establish and support a visualcommunication session may be referred to as a “communication system.”For instance, a communication system may comprise device 131, or device131 in conjunction with camera 132, device 131 in conjunction withwearable force feedback gloves 133 and 134, device 131 in conjunctionwith a smartphone or personal computer, a wireless router, or the likesupporting visual communication sessions of device 131, and so on.

Similarly, device 141 may be associated with a second user (user 2) 192at a second physical environment 140. As illustrated in FIG. 1 , thedevice 141 may comprise a wearable computing device that is the same orsimilar to device 131. However, in another example, device 141 maycomprise a personal computer, desktop computer, or the like. As furtherillustrated in FIG. 1 , device 141 may provide a user interface 145 foruser 192 that displays at least one window 147. Also associated withuser 192 and/or second physical environment 140, is a camera 142 whichmay be facing user 192 and which may capture a video comprising thesecond physical environment 140, including user 192 and other items orobjects therein. In one example, camera 142 may be coupled to device 141and may provide a video stream of the second physical environment 140.

As illustrated in FIG. 1 , user 192 may also have force feedback gloves143 and 144 which may measure, record, and/or transmit data related tomovement and position, such as locations, orientations, accelerations,and so forth. For instance, force feedback gloves 143 and 144 may eachinclude a Global Positioning System (GPS) units, a gyroscope, a compass,one or more accelerometers, and so forth. Force feedback gloves 143 and144 may also include a plurality of actuators which may be controllable(e.g., remotely) to provide positive force and/or movement of variousportions of the hands of user 192 (e.g., electromechanical actuators ormotors, electro-hydraulic actuators, electropneumatic actuators, etc.).In one example, force feedback gloves 143 and 144 may includetransceivers for wireless communications, e.g., for Institute forElectrical and Electronics Engineers (IEEE) 802.11 based communications(e.g., “Wi-Fi”), IEEE 802.15 based communications (e.g., “Bluetooth”,“ZigBee”, etc.), cellular communication (e.g., 3G, 4G/LTE, 5G, etc.),and so forth. User 191 may be similarly equipped with force feedbackgloves 133 and 134, which may be the same or similar to force feedbackgloves 143 and 144. As such, force feedback gloves 143 and 144 mayprovide various measurements to device 141 and/or to AS 104 (e.g., viadevice 141 and/or via access network 122) and may similarly receivecontrol signals from force feedback gloves 133 and 134, device 141, AS104, etc. Similarly, force feedback gloves 133 and 134 may providevarious measurements to device 131 and/or to AS 104 (e.g., via device131 and/or via access network 120) and may similarly receive controlsignals from force feedback gloves 143 and 144, device 131, AS 104, etc.

In one example, devices 131 and 141 may communicate with each otherand/or with AS 104 to establish, maintain/operate, and/or tear-down avisual communication session. In one example, AS 104 and device 131and/or device 141 may operate in a distributed and/or coordinated mannerto perform various steps, functions, and/or operations described herein.To illustrate, AS 104 may establish and maintain visual communicationsessions for various users and may store and implement one or moreconfiguration settings specifying both inbound and outboundmodifications of visual content from the various users. The visualcontent may comprise video content, which may include visual imagery ofa physical environment (e.g., including imagery of one or more users),and which in some cases may further include recorded audio of thephysical environment.

As used herein, the terms augmented reality (AR) environment and virtualenvironment may be used herein to refer to the entire environmentexperienced by a user, including real-world images and sounds combinedwith images and sounds of the AR environment/virtual environment. Theimages and/or sounds (or portions thereof) of an AR environment may bereferred to as “virtual objects” and may be presented to users viadevices and systems of the present disclosure. While the real world mayinclude other machine generated images and sounds, e.g., animatedbillboards, music played over loudspeakers, and so forth, these imagesand sounds are considered part of the “real-world,” in addition tonatural sounds and sights such as other physically present humans andthe sound they make, the sound of wind through buildings, trees, etc.,the sight and movement of clouds, haze, precipitation, sunlight and itsreflections on surfaces, and so on.

In one example, AS 104 may receive a request to establish a visualcommunication session from device 131 and/or device 141. The visualcommunication session may be established for such devices after AS 104retrieves one or more configuration settings for the user 191 and/oruser 192, determines which configuration setting(s), if any, to applybased upon the context(s), and activates the respective object detectionmodels, action detection models, and/or configuration setting(s) whichare determined to apply to the context(s). The request may be receivedvia access network 120, access network 122, network 102, and/or theInternet in general, and the visual communication session may beprovided via any one or more of the same networks.

The establishment of the visual communication session may includeproviding security keys, tokens, certificates, or the like to encryptand to protect the media streams between devices 131 and 141 when intransit via one or more network and to allow devices 131 and 141 todecrypt and present received video content and/or received userinterface content via user interfaces 135 and 145, respectively. In oneexample, the establishment of the visual communication session mayfurther include reserving network resources of one or more networks(e.g., network 102, access networks 120 and 122, etc.) to support aparticular quality of service (QoS) for the visual communication session(e.g., a certain video resolution, a certain delay measure, and/or acertain packet loss ratio, and so forth). Such reservation of resourcesmay include an assignment of slots in priority queues of one or morerouters, the use of a particular QoS flag in packet headers which mayindicate that packets should be routed with a particular priority level,the establishment and/or use of a certain label-switched path with aguaranteed latency measure for packets of the visual communicationsession, and so forth.

In one example, AS 104 may establish a communication path such thatmedia streams between device 131 and device 141 pass via AS 104, therebyallowing AS 104 to implement modifications to the visual content inaccordance with the applicable configuration setting(s). The one or moreconfiguration settings may be user-specified, may be based upon thecapabilities of devices of user 191 and/or user 192 being used for thevisual communication session, may be specific to the context (e.g., aparticular task for which user 191 is obtaining assistance from user192), and so forth. As just one example, device 131 may provideinformation regarding the capabilities and capacities of device 131 andcamera 132 to AS 104 in connection with a request to establish a visualcommunication session with device 141. AS 104 may send a notification ofthe request to device 141. Similarly, device 141 may provide informationregarding the capabilities and capacities of device 141 and camera 142to AS 104 in connection with a response to the request/notification toestablish the visual communication session.

In one example, the visual communication session may be a joint ARsession, e.g., established privately among the users via AS 104 and/orthe users’ respective communication systems, or hosted by AS 104 forpublic or semi-public usage. In one example, device 131 and/or device141 may indicate a purpose for the visual communication session (e.g.,further context). For example, a particular task of installing a set-topbox, performing physical therapy on a leg of user 191, etc. may beindicated. In this regard, AS 104 may maintain different settings tomatch to different contexts, e.g., different tasks. For instance, eachtask may have a different set of object detection and/or recognitionmodels and a different set of action detection models. In addition, inone example, each task may have different workflows/maps comprisingsequences of actions to be performed to complete an overall task. In oneexample, the system 100 supports the creation of object and actiondetection models and associated one or more configuration settings. Forexample, the configuration settings may map objects and objectrecognition models, and actions and action detection models withapplicable contexts to activate the object recognition models and actiondetection models (and corresponding modifications to visual content).

However, it should be noted that in one example, the present disclosurespecifically does not adhere to a workflow and does not have a requiredsequence of actions. Rather, AS 104 may detect an action of an expertuser and may then determine how closely another user adheres to theaction of the expert user, e.g., without any preconception as to whichaction will be performed, which action will follow next, etc., or withonly a loose set of expectations as to which actions might be performed.For instance, for a router repair task, an action of “throwing” may beconsidered, with high confidence, to be an unlikely action. Other typesof actions that may have been performed for prior router repair tasksmay be scanned against the video stream with some expectation that theymay be performed, but without specific requirements that they are infact performed.

To illustrate, in the example of FIG. 1 , user 191 may seek assistancefrom user 192 regarding a task involving object A and tool B (e.g.,another object comprising a tool for working on object A). A visualcommunication session may therefore be established by device 131 and/ordevice 141 via AS 104 and over access networks 120, 122, etc.Video/visual data of user 191 may be transmitted by device 131 to AS 104and forwarded to device 141, which may be presented via window 147 ofuser interface 145. Similarly, video/visual data of user 192 may betransmitted by device 141 to AS 104 and forwarded to device 131, whichmay be presented via window 137 of user interface 135. In the presentexample, user 192 may be improperly using tool B, which may be seen byuser 192 (e.g., an expert). User 192 may therefore demonstrate a propertechnique using object C and tool D (e.g., another object comprising atool for working on object C). For instance, object C may be the same orsimilar to object A, and tool D may be the same or similar to tool D.Device 141 may transmit video of user 192 to AS 104, which may forwardthe video to device 131 for presentation via window 137 of userinterface 135.

In addition to facilitating two-way video transmissions, AS 104 mayfurther detect and recognize objects, and actions and/or intentions withrespect to objects in both videos/visual data streams. For instance,user 191 may indicate that assistance is needed with respect to aparticular task involving object A. In one example, AS 104 may, inresponse to the identification of the particular task, retrieveassociated object detection models and action detection models (and inone example, a task workflow). For instance, in one example, detectingobjects and detecting actions may be in accordance with modelsassociated with particular tasks that may be identified in advance or atbeginning of interaction, e.g., if it is known that second user has kneeproblems, it is expected to see and detect actions relating to thigh andcalf, if it is known that second user is setting up a router, it isexpected to see and detect a yellow cable being plugged-in to anEthernet port on the router, etc. Thus, there may be particular modelsfor detecting particular objects or types of objects that areselectively utilized when the type of task is known. In one example, AS104 may determine the “relevance” of objects in the videos based uponuser intention (e.g., an action with respect to an object) or knowledgeof the task and the objects associated with the task (e.g., tools thatare customarily used, parts to assemble a final item/object, etc.).

To illustrate, AS 104 may store visual information of various objects asobject detection and/or recognition models. This may include one or moreimages of objects (e.g., from different angles), and may alternativelyor additionally include feature sets derived from one or more images ofthe respective objects. For instance, for an object type of a particularrouter, AS 104 may store a respective scale-invariant feature transform(SIFT) model, or a similar reduced feature set derived from image(s) ofthe router, which may be used for detecting routers in the visualinformation from camera 132, camera 142, device 131, device 141, etc.,via feature matching (and similarly for other objects). Thus, in oneexample, a feature matching detection algorithm employed by AS 104 maybe based upon SIFT features. However, in other examples, differentfeature matching detection algorithms may be used, such as a Speeded UpRobust Features (SURF)-based algorithm, a cosine-matrix distance-baseddetector, a Laplacian-based detector, a Hessian matrix-based detector, afast Hessian detector, etc.

The visual features used for detection and recognition of objects mayinclude low-level invariant image data, such as colors (e.g., RGB(red-green-blue) or CYM (cyan-yellow-magenta) raw data (luminancevalues) from a CCD/photosensor array), shapes, color moments, colorhistograms, edge distribution histograms, etc. Visual features may alsorelate to movements in a video and may include changes within images andbetween images in a sequence (e.g., video frames or a sequence of stillimage shots), such as color histogram differences or a change in colordistribution, edge change ratios, standard deviation of pixelintensities, contrast, average brightness, and the like.

In one example, AS 104 may perform an image salience detection process,e.g., applying an image salience model and then performing an imagerecognition algorithm over the “salient” portion of the video frame(s)or other visual information. Thus, in one example, visual features mayalso include a length to width ratio of an object, a velocity of anobject estimated from a sequence of images (e.g., video frames), and soforth. Similarly, in one example, AS 104 may apply an object detectionand/or edge detection algorithm to identify possible unique items invideo or other visual information (e.g., without particular knowledge ofthe type of item; for instance, the object/edge detection may identifyan object in the shape of a tree in a video frame, without understandingthat the object/item is a tree). In this case, visual features may alsoinclude the object/item shape, dimensions, and so forth. In such anexample, object recognition may then proceed as described above (e.g.,with respect to the “salient” portions of the image(s) and/or video(s)).

In one example, the detection and/or recognition of objects and theirlocations and/or position in the visual data may be in accordance withone or more machine learning algorithms (MLAs), e.g., one or moretrained machine learning models (MLMs). For instance, a machine learningalgorithm (MLA), or machine learning model (MLM) trained via a MLA maybe for detecting a single object, or may be for detecting a singleobject from a plurality of possible objects that may be detected via theMLA/MLM. For instance, the MLA (or the trained MLM) may comprise a deeplearning neural network, or deep neural network (DNN), such asconvolutional neural network (CNN), a generative adversarial network(GAN), a support vector machine (SVM), e.g., a binary, non-binary, ormulti-class classifier, a linear or non-linear classifier, and so forth.In one example, the MLA/MLM may be a SIFT or SURF features-baseddetection model, as mentioned above. In one example, the MLA mayincorporate an exponential smoothing algorithm (such as doubleexponential smoothing, triple exponential smoothing, e.g., Holt-Winterssmoothing, and so forth), reinforcement learning (e.g., using positiveand negative examples after deployment as a MLM), and so forth. Itshould be noted that various other types of MLAs and/or MLMs may beimplemented in examples of the present disclosure, such as k-meansclustering and/or k-nearest neighbor (KNN) predictive models, supportvector machine (SVM)-based classifiers, e.g., a binary classifier and/ora linear binary classifier, a multi-class classifier, a kernel-basedSVM, etc., a distance-based classifier, e.g., a Euclidean distance-basedclassifier, or the like, and so on. In one example, the object detectionand/or recognition MLM(s) may be trained at a network-based processingsystem (e.g., AS 104, or the like). It should also be noted that variouspre-processing or post-recognition/detection operations may also beapplied. For example, AS 104 may apply an image salience algorithm, anedge detection algorithm, or the like (e.g., as described above) wherethe results of these algorithms may include additional, or pre-processedinput data for the one or more MLAs. Thus, in the example of FIG. 1 , AS104 may apply any number of image pre-processing algorithms to videos,and may apply at least one object detection/recognition MLA to detectobjects A, B, C, D, E, etc. of FIG. 1 from among various types ofdetectable objects in accordance with the one or more MLAs applied byand in operation on AS 104.

Similarly, in one example, an action detection model, or “signature” maybe created that represents a particular action (e.g., where differentactions may each have a respective model). The action detection modelmay comprise a machine learning algorithm (MLA), or machine learningmodel (MLM) trained via the MLA and which may comprise, for example, adeep learning neural network, or deep neural network (DNN), a generativeadversarial network (GAN), a support vector machine (SVM), e.g., abinary, non-binary, or multi-class classifier, a linear or non-linearclassifier, and so forth. In one example, the MLA may incorporate anexponential smoothing algorithm (such as double exponential smoothing,triple exponential smoothing, e.g., Holt-Winters smoothing, and soforth), reinforcement learning (e.g., using positive and negativeexamples after deployment as a MLM), and so forth. It should be notedthat various other types of MLAs and/or MLMs may be implemented inexamples of the present disclosure, such as k-means clustering and/ork-nearest neighbor (KNN) predictive models, support vector machine(SVM)-based classifiers, e.g., a binary classifier and/or a linearbinary classifier, a multi-class classifier, a kernel-based SVM, etc., adistance-based classifier, e.g., a Euclidean distance-based classifier,or the like, and so on. In one example, the signature may include thosefeatures which are determined to be the most distinguishing features ofthe action, e.g., those features which are quantitatively the mostdifferent from what is considered statistically normal or average fromvisual content associated with a given participant, a group ofparticipants, a given context, and/or in general, e.g., the top 20features, the top 50 features, etc.

In one example, an action detection model, or “signature” may be createdthat represents multiple detected actions having a threshold similarity.In other words, the multiple detected actions are considered to beunique occurrences of a same action, or a same type of action. Forinstance, the action signature may comprise a machine learning model(MLM) that is trained based upon the plurality of features from aplurality of the same and/or similar events. For example, each of thesimilar events may comprise a set of features used as a positive examplethat is applied to a machine learning algorithm (MLA) to generate theaction signature (e.g., a MLM). In one example, the positive examplesused to train the MLM may be determined to be “similar” in accordancewith an unsupervised, supervised, and/or semi-supervised clusteringalgorithm. In one example, the event detection model may be representedas an MLM comprising the average features of a cluster of the pluralityof similar events in a feature space, a cluster centroid, or the like.

In one example, the action detection model (e.g., a MLM) may be appliedto process outbound and/or inbound visual content and to identifypatterns in the features of the visual content that match the actiondetection model/signature. In one example, a match may be determinedusing any of the visual features and/or other features mentioned above.For instance, a match may be determined when there is a thresholdmeasure of similarity among the features of the visual content and theaction detection model. In one example, the threshold measure ofsimilarity may alternatively or additionally include matching additionalfeatures associated with measurements from wearable devices and/or othersensors. In one example, the features from the visual content and/oradditional features may be analyzed using a time-based sliding window.Thus, the next time there is a similar sequence of events, e.g., similarimagery and/or movements as recorded by wearable devices and/or othersensors, it may be associated with the action type and may be identifiedas a potential additional occurrence of the same action.

In the present example, object A and tool B may be detected via objectdetection and/or recognition models and determined to be relevant basedupon the movement of tool B by user 191 detected via an action detectionmodel and/or based upon the association of tool B with the taskinvolving object A. Similarly, object C and tool D may be detected viaobject recognition models and determined to be relevant based upon themovement of tool D by user 192 detected via an action detection modeland/or based upon the association of tool D with the task involvingobject C. Alternatively, or in addition, object C and tool D may bedetermined to be relevant based upon a prior determination that object Aand tool B are relevant and a determination that object A and tool B arecongruent with object C and tool D, respectively. In one example,relevance of any or all of objects A and C and tools B and D may befurther determined based upon speech detection and analysis identifyingparticular objects. For instance, user 191 may ask “How hard do I turnthe screwdriver on the back of the router?” If tool B is a screwdriverand object A is a router, then these items may be further determined byAS 104 to be relevant, or the relevance confirmed via the detection ofat least the words “screwdriver” and “router” in the speech of user 191.

It should be noted that in one example, AS 104 may further determine anyirrelevant objects, or specific sensitive objects that may be removed orobscured from the video/visual feed of one user prior to being forwardedto the device of the other user. For example, object E may be a stack ofpapers which may be identified by AS 104 as a “stack of papers” via anobject detection/recognition model and analysis of the video from device141, or may simply be detected as an object that is not relevant (e.g.,if it is not an object having an object detection model associated withthe task and is not otherwise determined to be relevant based upon theaction(s)/intention(s)of user 192, such as by picking up or touching thestack of papers). In this case, AS 104 may determine object E isirrelevant and may dynamically blur object E from the video forwarded todevice 131. Thus, as illustrated in FIG. 1 , the window 137 of userinterface 135 shows obscured area 182 where object E would otherwise beseen. However, in another example, object E may be excluded entirelyfrom the video presented in window 137.

As noted above, there may be particular models for detecting particulartypes of actions and quantifying the actions when the type of task isknown. In one example, an action may be detected by sensors of forcefeedback gloves 133 and 134 and force feedback gloves 143 and 144 thatsense force in fingers, palm, etc., but also X, Y, Zmovement/displacement, acceleration, or the like via gyroscope andcompass, or similar sensors. Accordingly, in one example, objects may beknown via object detection/recognition, and then the action(s)quantified via the force feedback glove(s). Alternatively, or inaddition, actions may be quantified via video analysis to identify adisplacement of an object, e.g., between video frames, or similarly foracceleration, rotation, etc.

In one example, AS 104 may detect a difference between an action of user192 and an action of user 191 and provide a notification indicative ofthe difference. For example, the actions may be quantified as describedabove. In addition, when the two actions are the same or similar, andare determined to relate to similar objects (e.g., rotations of objectsA and C), the quantified actions may be compared and the difference(s)recorded. For example, if tool B is moved 10 inches and tool D is moved6 inches, the difference of 4 inches may be recorded. Similarly, if thepressure in a finger sensor of force feedback glove 143 is recorded as45 psi and the pressure in a finger sensor of force feedback glove 133is recorded as 30 psi while holding objects A and C, respectively, thedifference of 15 psi may be recorded. Thus, in various examples, thedifference may be “difference in displacement,” “difference inrotation,” “difference in force in index finger and thumb,” “differencein grip force,” “difference in contact in 4^(th) and 5^(th) fingers ofright hand,” etc. In addition, corresponding notifications indicative ofthe difference may simply report the difference(s), or may provide arecommendation, such as “insufficient displacement” (or “move tool B alittle further”), “insufficient rotation” (or “rotate object Aapproximately 90 degrees”), “insufficient force in index finger andthumb,” “insufficient grip force,” “lack of contact in 4^(th) and 5^(th)fingers of right hand” (or “make contact with object A in 4^(th) and5^(th) fingers of right hand”), etc.

In one example, AS 104 may send the notification or recommendation toone or both of users 191 and 192 via devices 131 and/or 141. However, inone example, it may transmitted to user 192 (who may be an expert) whomay then decide whether the recommendation is actionable (where thenotification is not transmitted to user 191). If determined to beactionable, user 192 may then choose to instruct user 191 accordingly,e.g., verbally via the video/visual feed from device 131 that is sent todevice 141 via the visual communication session. For instance,differences may be alerted to user 192, but the user 192 may determinethat the differences are inconsequential and may be ignored, or maychoose to take note of any substantial differences that may call forcorrection of the action(s) of user 191. For instance, as illustrated inFIG. 1 , user 191 may be holding tool B in a steady position, whereasuser 192 may be demonstrating a proper technique using thesimilar/corresponding tool D, e.g., waving tool D, rotating tool D, orthe like. Any detected differences at such initial demonstrations may bedeemed to be inconsequential and the alerts can be ignored or evensuppressed based on user input.

In one example, user 192 may tangibly and palpably demonstrate propertechnique for user 191 using tactile feedback and control by actionscaptured via sensors of force feedback gloves 143 and 144 and translatedinto control signals for generating corresponding actions via actuatorsof force feedback gloves 133 and 134. For instance, user 192 mayverbally instruct user 191 to place tool B on the back of object A andhold the position. Then, user 192 may activate remote controlfunctionality via AS 104. For instance, user 192 may speak a commandsuch as “remote glove control” or “expert takeover,” which may beinterpreted by AS 104 and recognized as a command to activate remotecontrol. Alternatively, or in addition, user 192 may perform a gesturevia one or both of the force feedback gloves 143 and 144 to similarlyprovide such a command (such as tapping the left index finger fourtimes, squeezing the 4^(th) finger and thumb together two times, etc.).In one example, user 191 may provide a consent or confirmation to allowremote takeover, such as making a similar gesture, or speaking a verbalassent/confirmation, such as “ready for takeover,” “takeover go,” etc.that may be captured by a microphone of device 131 and included in oraccompanying the video/visual data stream to be received and recognizedby AS 104.

Thus, user 192 may cause the proper action(s) to be performed withrespect to object A, e.g., using tool B via remote control of forcefeedback gloves 133 and 134 using force feedback gloves 143 and 144. Forinstance, user 192 may arrange object C and tool D in substantially thesame positions as object A and tool B are arranged by user 191immediately before and at the time of the remote takeover. Then, user192 may perform one or more actions on object C using tool D, whichresults in corresponding actions on object A using tool B via the remotecontrol of force feedback gloves 133 and 134 using force feedback gloves143 and 144.

The foregoing describes an example of a network-based application via AS104. However, it should be understood that in other, further, anddifferent examples, the same or similar functionality may alternativelyor additionally be applied locally, e.g., at device 131 and/or at device141, such as in a peer-to-peer session that does not involve AS 104, orwhere some (but not all) functions of AS 104 may instead be performed bydevice 131 and/or device 141 (such as object detection, relevancedetection, voice command interpretation, etc.). It should also be notedthat the foregoing is described primarily in connection with wearabledevices 131 and 141. However, in other, further, and different examples,other devices, such as cameras 132 and 142 (e.g., non-wearables), may beutilized in connection with a visual communication session. Forinstance, for an intricate carving process, camera 142 may be situatedabove a tabletop where user 192 is working and pointed downward tocapture the hands of user 192 and the object(s) being worked on orworked with, etc. In one example, a user may switch views from a cameraof a wearable device to an external camera, or the remote user on theother end of the visual communication session, such as an expert user,may choose to change which feed is received (or to receive and presentboth feeds simultaneously in different windows of the user interface).In addition, the example of FIG. 1 illustrates just one example of howtwo-way AR content may be visually presented for respective users viauser interfaces 135 and 145. For instance, FIG. 2 illustrates anotherexample of how two-way AR content may be visually presented forrespective users. Thus, these and other modifications are allcontemplated within the scope of the present disclosure.

It should also be noted that the system 100 has been simplified. Thus,it should be noted that the system 100 may be implemented in a differentform than that which is illustrated in FIG. 1 , or may be expanded byincluding additional endpoint devices, access networks, networkelements, application servers, etc. without altering the scope of thepresent disclosure. In addition, system 100 may be altered to omitvarious elements, substitute elements for devices that perform the sameor similar functions, combine elements that are illustrated as separatedevices, and/or implement network elements as functions that are spreadacross several devices that operate collectively as the respectivenetwork elements. For example, the system 100 may include other networkelements (not shown) such as border elements, routers, switches, policyservers, security devices, gateways, a content distribution network(CDN) and the like. For example, portions of network 102, accessnetworks 120 and 122, and/or Internet may comprise a contentdistribution network (CDN) having ingest servers, edge servers, and thelike for packet-based streaming of video, audio, or other content.Similarly, although only two access networks, 120 and 122 are shown, inother examples, access networks 120 and/or 122 may each comprise aplurality of different access networks that may interface with network102 independently or in a chained manner. Thus, these and othermodifications are all contemplated within the scope of the presentdisclosure.

FIG. 2 illustrates an example of a visual communication session 200between a first user 201 and a second user 202 having respectivewearable AR devices 210 and 220. Visible space 280 may represent thearea or region that may be recorded via AR device 210. Similarly,visible space 290 may represent the area or region that may be recordedvia AR device 220. AR environment 285 may comprise a combination ofreal-world visible objects and AR overlay content that are perceptibleto the user 201, and in one example may have the same size and/or shapeas visible space 290. Thus, for example, real objects 281 and 282, aswell as hands wearing force feedback gloves 283 may be visible in the ARenvironment 285. The outlined content 289 may represent overlaid ARcontent, e.g., “virtual objects” presented by a projector of AR device210, which may comprise video from AR device 220 comprising objects fromthe visible space 290, and which may be presented with some opacity suchthat the outlined content 289 is visible but does not completely obscureaspects of the physical environment. In the same manner, AR environment295 may comprise a combination of real-world visible objects and ARoverlay content/virtual objects that are perceptible to the user 202.Thus, for example, real objects 291 and 292, as well as hands wearingforce feedback gloves 293 may be visible in the AR environment 295. Theoutlined, overlay content 299 in the AR environment 295 may representoverlaid AR content, e.g., virtual objects presented by a projector ofAR device 220, which may comprise video from AR device 210 comprisingobjects from the visible space 280. For example, user 201 may be anexpert demonstrating a proper technique of using a tool (e.g., object282) on object 281. User 201 may see that user 202 has positioned thecorresponding tool (e.g., object 292) incorrectly by virtue of theoverlay content 289 presented in AR environment 285. Likewise, user 202may see that user 202 has positioned the tool (e.g., object 292)incorrectly by virtue of the overlay content 299 presented in ARenvironment 295. Notably, in the example of FIG. 2 , while the objects281 and 291 are not necessarily aligned perfectly, the difference may besufficiently insubstantial such that no correction may be necessary, orit may be at least not worth drawing attention to the difference.

In another example, to simplify demonstration for the learning user 202,the display of AR environment 295 may perfectly align (e.g., “snap” or“lock”) some overlay content of user 201 (e.g., portion of overlaycontent 299 representing the force feedback gloves 283) with the contentof user 202 (e.g., the force feedback gloves 293) to exactly align handposition while explicitly demonstrating the incorrect orientation of themanipulated objects (portion(s) of overlay content 299 representingobject 282 or 281 with actual object 292 or 291). At the same time, theview for the expert or leading user 201 in AR environment 285 may havethe portion of the overlay content 289 representing force feedbackgloves 283 that are not aligned or snapped to match. This example mayallow improved instructions of technique from the first user 201 whilesimultaneously simplifying the manipulation of objects by the seconduser 202. In yet another example, although not depicted in FIG. 2 , bothAR environments 285 and 295 (e.g., the users’ views) may first alignedon a specific object of interest, such as object 281 and object 291,which may be focal point of the task (e.g. a mobile phone or router),attached to an additional object (e.g. the door of a car), or animmovable object (e.g. a heating ventilation and air conditioning (HVAC)condenser). The system may first automatically align a portion ofoverlay content 289 representing object 291 with object 281, and aportion of overlay content 299 representing object 281 with object 291as a first step in the process. This first alignment process mayguarantee that both users 201 and 202 are correctly oriented for theobject regardless of the positions in their respective physicalenvironments.

FIG. 3 illustrates a flowchart of an example method 300 for detectingand modifying actions of visual representations of users in visualcontent, in accordance with the present disclosure. In one example, themethod 300 is performed by a component of the system 100 of FIG. 1 ,such as by application server 104, device 131, or device 141, and/or anyone or more components thereof (e.g., a processor, or processors,performing operations stored in and loaded from a memory), or byapplication server 104, in conjunction with one or more other devices,such as device 131, device 141, and so forth. In one example, the steps,functions, or operations of method 300 may be performed by a computingdevice or system 400, and/or processor 402 as described in connectionwith FIG. 4 below. For instance, the computing device or system 400 mayrepresent any one or more components of application server 104, device131, or device 141 in FIG. 1 that is/are configured to perform thesteps, functions and/or operations of the method 300. Similarly, in oneexample, the steps, functions, or operations of method 300 may beperformed by a processing system comprising one or more computingdevices collectively configured to perform various steps, functions,and/or operations of the method 300. For instance, multiple instances ofthe computing device or processing system 400 may collectively functionas a processing system. For illustrative purposes, the method 300 isdescribed in greater detail below in connection with an exampleperformed by a processing system. The method 300 begins in step 305 andmay proceed to optional step 310 or to step 320.

At optional step 310, the processing system may receive a request toestablish a communication session (e.g., a visual communication session)from at least one of a first communication system of a first user or asecond communication system of a second user. The processing system mayinclude at least one processor deployed in a communication network. Theprocessing system may alternatively or additionally comprise the firstcommunication system of the first user, the second communication systemof the second user, and/or network-based components. The communicationsession may be for a video call, a two-way AR session, or the like. Thefirst communication system and the second communication system maycomprise wearable devices, such as AR devices 210 and 220 of FIG. 2 ,devices 131 and 141 of FIG. 1 , or the like. In one example, the firstcommunication system and the second communication system may include atleast a first force feedback glove and at least a second force feedbackglove, respectively.

At optional step 315, the processing system may establish acommunication session between at least a first communication system of afirst user and a second communication system of a second user, thecommunication session including first video of the first user secondvideo of the second user (e.g., to be exchanged between the user’srespective communication systems). It should also be noted that althoughthe terms, “first,” “second,” “third,” etc., are used herein, the use ofthese terms are intended as labels only. Thus, the use of a term such as“third” in one example does not necessarily imply that the example mustin every case include a “first” and/or a “second” of a similar item. Inother words, the use of the terms “first,” “second,” “third,” and“fourth,” do not imply a particular number of those items correspondingto those numerical values. In addition, the use of the term “third” forexample, does not imply a specific sequence or temporal relationshipwith respect to a “first” and/or a “second” of a particular type ofitem, unless otherwise indicated.

At step 320, the processing system detects a first object in the firstvideo of the first user. In one example, the detecting of the firstobject is via an object recognition model (e.g., a visual objectrecognition model that uses visual data of the first video as input andwhich outputs an indication of whether or not an object of a particulartype associated with the object recognition model). In one example, thedetecting of the first object is further in accordance with user accountinformation of the second user, an indication of a task associated withthe first object, and/or other factors. For instance, if it is knownthat the first user is seeking assistance with a particular task, andthe task is known to be associated with one or more specific objects(e.g., including at least the first object) then the processing systemmay monitor the first video for those one or more specific objects. Forexample, one or more object recognition models that are specific to theobject(s) associated with the task may be activated for scanning againstthe visual content. Similarly, if it known (e.g., from user accountinformation) that a user has a particular type of equipment, such as aparticular type of television, a particular router make and model, orthe like, then the processing system may be configured to specificallylook for and detect such object(s) via respective object detectionmodel(s) for such object(s).

At step 325, the processing system detects a second object in the secondvideo of the second user (e.g., where the first video and the secondvideo are part of the visual communication session between the firstuser and the second user). In one example, the detecting of the secondobject is via a same object recognition model that is used at step 320.For instance, the first object and the second object may be a same typeof object. In one example, the first object and the second object may bea same type of body part of the first user and the second user. Inanother example, the first object and the second object may be a sametype of equipment, e.g., electronic equipment, mechanical equipment, andso forth. In one example, the detecting of both the first object and thesecond object may be further in accordance with at least one of: useraccount information of the second user or an indication of a taskassociated with the first object and the second object.

At step 330, the processing system detects a first action in the firstvideo relative to the first object. For instance, the processing systemmay scan the first video against active action detection models fordetecting particular types of actions and for quantifying the actionswhen the type of task is known. For instance, types of actions that maybe expected and that may be detected via the one or more actiondetection models may be “throwing,” “swinging,” “chopping,” “twisting,”etc. In one example, the action detection model(s) may be specific toparticular objects, e.g., “twisting screwdriver,” “twisting rope,” etc.Thus, the processing system may detect the first action via a firstaction detection model for detection the particular type of action. Inone example, the action may be detected by sensors of force feedbackgloves worn by the first user that sense force in fingers, palm, etc.,but also 3D movement/displacement, acceleration, or the like via gyroand compass, or similar sensors. Thus, the first object may be detectedvia an object detection/recognition model at step 320, and then thefirst action may be quantified via the force feedback glove(s) at step330. Alternatively, or in addition, the first action may be quantifiedvia video analysis to identify a displacement of the first object, e.g.,between frames of the first video, and similarly for acceleration,rotation, etc.

Accordingly, it should be noted that in various examples, the firstaction in the first video relative to the first object may comprise amanipulation of the first object detected via at least one of the firstvideo or at least a first force feedback glove of the first user, or amanipulation of at least a third object (such as a tool, anintegral/constituent part of the first object, etc.) relative to thefirst object detected via at least one of the first video or at leastthe first force feedback glove. In addition, the “manipulation” maycomprise movement (e.g., displacement or rotation), squeezing, pressing,etc., may be quantified in terms of distance, force applied, number ofturns, or multiple aspects, depending upon the type of action and whatis detectable via the system capabilities in terms of visibility fromcamera perspective, available sensor data, etc.

In one example, the first action detection model comprises a machinelearning model (MLM) for detecting the first action. The MLM mayidentify, from the first video, features of the fist user (and or thefirst object) that distinguish the first action from visual content thatdoes not contain the first action. In one example, the features are froma feature space comprising quantified aspects of the visual content.Quantified aspects may include low-level invariant image data, featuresrelating to movement in a video, e.g., changes within images and betweenimages, recognized objects (e.g., including parts of a human body suchas legs, arms, hands, etc.), a length to width ratio of an object, avelocity of an object estimated from a sequence of images (e.g., videoframes), and so forth. In one example, features may additionally betaken from wearable device inputs such as gyroscope and compassmeasurements from various points of a human body (e.g., force feedbackglove(s) and/or sensors positioned on other parts of the body, such aselbows, shoulders, etc.), eye movements, and so forth.

The first action detection model/MLM can be trained from input of otherusers regarding actions by various other users. In one example, thefirst action detection model may be activated by the processing systemfor detection when the task is known. In one example, the processingsystem may select the first action detection model for active use whenone or more context criteria are met. For instance, the processingsystem may activate the action detection model when the context includesone or more of: a physical location of the first user, a physicallocation of the second user, a time of day, a type of task for thecommunication session, a topic of the communication session, and soforth. In one example, the context may be that the first user hasprovided an input to the processing system indicating the task, or tasktype, a problem with a particular type of equipment, etc., and/or mayinclude the same or similar information obtained via a user profile,user account record, or the like.

At step 335, the processing system detects a second action in the secondvideo relative to the second object. In one example, step 335 may be thesame or similar as discussed above with respect to step 330. Forinstance, the second action may be a same type of action as the firstaction (and may be with respect to the second object, which may be thesame or similar to the first object). In various examples, the secondaction in the second video relative to the second object may comprise atleast one of a manipulation of the second object detected via at leastone of the second video or at least a second force feedback glove of thesecond user or a manipulation of at least a fourth object (such as atool, an integral/constituent part of the second object, etc.) relativeto the second object detected via at least one of the second video orthe at least the second force feedback glove. Thus, it should be notedthat in one example, the first user engages in the first action via atleast a first force feedback glove, and wherein the second user engagesin the second action via at least a second force feedback glove.

At optional step 340, the processing system may detect additionalobjects in the first video and/or the second video, e.g., at least a“fifth” object in the second video and/or at least a “sixth” object inthe first video, etc. In one example, optional step 340 may be the sameor similar to steps 320 and 325. Alternatively, or in addition, optionalstep 340 may comprise detecting one or more objects, without anyspecific determination of the type of object(s). For instance, optionalstep 340 may identify one or more objects by distinguishing such objectsfrom each other, from the first and second objects identified at steps320 and 325, and from other aspects of the respective environments, suchas via edge detection algorithm(s), or the like.

At optional step 345, the processing system may determine whether anyadditional detected objects are relevant (or not relevant) to the firstaction or the second action, respectively. For instance, objects may bedetected and/or recognized via object detection and/or recognitionmodels at optional step 340 and may be determined to not be relevant toa task, e.g., where the task is associated with “relevant” objects, andwhere other objects may thus be categorized as “not relevant.”Alternatively, or in addition, objects may be detected in a generalsense, e.g., determining the outlines of objects to distinguish fromother objects, without any particular determination as to what theobject is. In this case, some objects may be recognized, such as thefirst object and the second object, via one or more object recognitionmodels. However, other objects, while detected and distinguished fromother objects, may not be recognized. In one example, unrecognizedobjects may be categorized as “not relevant,” unless there is aparticular action of the first user or the second user directed towardan unrecognized object, in which case, even though the object isunrecognized, the object may be categorized as “relevant” to the task.For example, at least a “fifth” object in the second video and/or atleast a “sixth” object in the first video may be determined to be “notrelevant” at optional step 345.

At optional step 350, the processing system may obscure one or moreobjects that are determined to not have relevance to the first action orthe second action, respectively (for instance, at least a “fifth” objectin the second video and/or at least a “sixth” object in the firstvideo). The obscuring may include blocking out a region of the firstvideo or the second video in which the non-relevant object(s) arelocated, blurring, replacing with another image or images (such as alogo, a “redacted” sign, etc.), and so forth. The modification(s) ofoptional step 350 may be selected by one or both of the users, may bedefined in connection with a default or standard profile that may beselected by or for a user, and/or may be associated with the particulartask.

At optional step 355, the processing system may extract relevant objectsfrom the first video and the second video. For instance, the first videomay be modified to exclude background and/or non-relevant objects. Inother words, the relevant objects may remain in the first video and thesecond video respectively, while extraneous visual content may beremoved.

At optional step 360, the processing system may transmit the first videoto a second communication system of the second user and the second videoto a first communication system of the first user (e.g., with anymodifications as per optional step 350 or optional step 355).

At step 365, the processing system detects a difference between thefirst action and the second action. For instance, actions may bequantified/measured in terms of movement (e.g., displacement orrotation), squeezing, pressing, etc., may be quantified in terms ofapplied force, number of turns, or multiple aspects, depending upon thetype of action and what is detectable via the system capabilities interms of visibility from camera perspective, force feedback glovecapabilities, other available sensor data, etc. Accordingly, thedifferences in the actions may be determined by comparing and noting thedifferences in the corresponding measurements on the side of the firstuser and the side of the second user. It should be noted that in oneexample, the detecting the difference between the first action and thesecond action may comprise detecting a difference between a manipulationof the first object and a manipulation of the second object, ordetecting a difference between a manipulation of at least the thirdobject relative to the first object and a manipulation of at least thefourth object relative to the second object (e.g., where the thirdand/or fourth objects may be those discussed above with respect to steps330 and 335).

At step 370, the processing system provides a notification indicative ofthe difference. In one example, the notification indicative of thedifference may simply report the difference(s). In one example, thenotification may comprise a recommendation for an adjustment associatedwith the second action based on the difference, such as “insufficientdisplacement,” “move tool B a little further,” etc. In one example, theprocessing system may send the notification or recommendation to one orboth of the user via the respective communication systems. However, inone example, the notification may be sent to an expert user (e.g., thefirst user) who may then decide whether the notification is actionable(e.g., where the notification is not sent to the second user). In oneexample, the notification may be presented audibly, e.g., by insertingaudio into the first and/or second video, or by sending an audioside-stream, and/or via a respective AR display or displays of the firstand/or second communication systems, etc. In one example, thenotification may comprise a recommendation in the form of a visualindication relative to at least the second object via an augmentedreality display of the second user. For example, the processing systemmay add to the first video an arrow pointing at the correct location(s)to align the second object with a tool, a highlighting on a particularobject (e.g., the second object) or part of the object, etc.

At optional step 375, the processing system may obtain an instruction toengage a remote control of the at least the second force feedback gloveby the first user via the at least the first force feedback glove. Theinstruction may be a voice command that is extracted from the firstvideo by the processing system and recognized via a speech detection andrecognition model. In one example, the model may include naturallanguage processing (NLP) which may enable the processing system tointerpret the command. However, in another example, the model may be amore basic speech detection model for detecting a particular definedcommand from a smaller set of defined available commands.

At optional step 380, the processing system may obtain control signalsfrom the at least the first force feedback glove. For instance, datafrom sensors of the at least the first force feedback glove may be usedto detect the first action at step 330, but may alternatively oradditionally be used as control signals, or may be translated intocontrol signals for controlling the at least the second force feedbackglove.

At optional step 385, the processing system may transmit the controlsignals to the at least the second force feedback glove, where thecontrol signals cause the at least the second force feedback glove toengage at least one actuator of the at least the second force feedbackglove. In one example, the control signals may be transmitted to/via thesecond communication system of the second user.

Following step 370, or any of the optional steps 375-385, the method 300proceeds to step 395 where the method ends.

It should be noted that the method 300 may be expanded to includeadditional steps, or may be modified to replace steps with differentsteps, to combine steps, to omit steps, to perform steps in a differentorder, and so forth. For instance, in one example the processing systemmay repeat one or more steps of the method 300, such as steps 330-370 orsteps 330-385 for additional actions or tasks with respect to aworkflow, steps 310-370 or steps 310-385 for additional visualcommunication of the same or different users with respect to the same ordifferent tasks, etc. In still another example, the method 300 may beexpanded to include task detection and then selecting object recognitionmodels and/or action detection models in accordance with the task. Forinstance, the processing may apply topic models (e.g., classifiers) fora number of task to the first video and/or the second video to identifya task. The topic model classifiers can be trained from any text, video,image, audio and/or other types of content to recognize various topics,which may include objects like “car,” scenes like “outdoor,” and actionsor events like “baseball.” Topic identification classifiers may includesupport vector machine (SVM) based or non-SVM based classifiers, such asneural network based classifiers and may utilize the same or similarfeatures extracted from the first video or the second video that may beused to identify objects and/or actions. Once a task is identified, thetask may be further correlated with object recognition and/or actiondetection models. Thus, these and other modifications are allcontemplated within the scope of the present disclosure.

In addition, although not expressly specified above, one or more stepsof the method 300 may include a storing, displaying and/or outputtingstep as required for a particular application. In other words, any data,records, fields, and/or intermediate results discussed in the method canbe stored, displayed and/or outputted to another device as required fora particular application. Furthermore, operations, steps, or blocks inFIG. 3 that recite a determining operation or involve a decision do notnecessarily require that both branches of the determining operation bepracticed. In other words, one of the branches of the determiningoperation can be deemed as an optional step. Furthermore, operations,steps or blocks of the above described method(s) can be combined,separated, and/or performed in a different order from that describedabove, without departing from the example embodiments of the presentdisclosure.

FIG. 4 depicts a high-level block diagram of a computing device orprocessing system specifically programmed to perform the functionsdescribed herein. For example, any one or more components or devicesillustrated in FIG. 1 or described in connection with the examples ofFIGS. 2 and 3 may be implemented as the processing system 400. Asdepicted in FIG. 4 , the processing system 400 comprises one or morehardware processor elements 402 (e.g., a microprocessor, a centralprocessing unit (CPU) and the like), a memory 404, (e.g., random accessmemory (RAM), read only memory (ROM), a disk drive, an optical drive, amagnetic drive, and/or a Universal Serial Bus (USB) drive), a module 405for providing a notification indicative of a detected difference inactions relating to a first object in a first video and a second objectin a second video of a visual communication session, and variousinput/output devices 406, e.g., a camera, a video camera, storagedevices, including but not limited to, a tape drive, a floppy drive, ahard disk drive or a compact disk drive, a receiver, a transmitter, aspeaker, a display, a speech synthesizer, an output port, and a userinput device (such as a keyboard, a keypad, a mouse, and the like).

Although only one processor element is shown, it should be noted thatthe computing device may employ a plurality of processor elements.Furthermore, although only one computing device is shown in the Figure,if the method(s) as discussed above is implemented in a distributed orparallel manner for a particular illustrative example, i.e., the stepsof the above method(s) or the entire method(s) are implemented acrossmultiple or parallel computing devices, e.g., a processing system, thenthe computing device of this Figure is intended to represent each ofthose multiple general-purpose computers. Furthermore, one or morehardware processors can be utilized in supporting a virtualized orshared computing environment. The virtualized computing environment maysupport one or more virtual machines representing computers, servers, orother computing devices. In such virtualized virtual machines, hardwarecomponents such as hardware processors and computer-readable storagedevices may be virtualized or logically represented. The hardwareprocessor 402 can also be configured or programmed to cause otherdevices to perform one or more operations as discussed above. In otherwords, the hardware processor 402 may serve the function of a centralcontroller directing other devices to perform the one or more operationsas discussed above.

It should be noted that the present disclosure can be implemented insoftware and/or in a combination of software and hardware, e.g., usingapplication specific integrated circuits (ASIC), a programmable logicarray (PLA), including a field-programmable gate array (FPGA), or astate machine deployed on a hardware device, a computing device, or anyother hardware equivalents, e.g., computer readable instructionspertaining to the method(s) discussed above can be used to configure ahardware processor to perform the steps, functions and/or operations ofthe above disclosed method(s). In one example, instructions and data forthe present module or process 405 for providing a notificationindicative of a detected difference in actions relating to a firstobject in a first video and a second object in a second video of avisual communication session (e.g., a software program comprisingcomputer-executable instructions) can be loaded into memory 404 andexecuted by hardware processor element 402 to implement the steps,functions or operations as discussed above in connection with theexample method(s). Furthermore, when a hardware processor executesinstructions to perform “operations,” this could include the hardwareprocessor performing the operations directly and/or facilitating,directing, or cooperating with another hardware device or component(e.g., a co-processor and the like) to perform the operations.

The processor executing the computer readable or software instructionsrelating to the above described method(s) can be perceived as aprogrammed processor or a specialized processor. As such, the presentmodule 405 for providing a notification indicative of a detecteddifference in actions relating to a first object in a first video and asecond object in a second video of a visual communication session(including associated data structures) of the present disclosure can bestored on a tangible or physical (broadly non-transitory)computer-readable storage device or medium, e.g., volatile memory,non-volatile memory, ROM memory, RAM memory, magnetic or optical drive,device or diskette and the like. Furthermore, a “tangible”computer-readable storage device or medium comprises a physical device,a hardware device, or a device that is discernible by the touch. Morespecifically, the computer-readable storage device may comprise anyphysical devices that provide the ability to store information such asdata and/or instructions to be accessed by a processor or a computingdevice such as a computer or an application server.

While various embodiments have been described above, it should beunderstood that they have been presented by way of example only, and notlimitation. Thus, the breadth and scope of a preferred embodiment shouldnot be limited by any of the above-described example embodiments, butshould be defined only in accordance with the following claims and theirequivalents.

What is claimed is:
 1. A method comprising: detecting, by a processingsystem including at least one processor, a first object in a first videoof a first user; detecting, by the processing system, a second object ina second video of a second user, wherein the first video and the secondvideo are part of a visual communication session between the first userand the second user; detecting, by the processing system, a first actionin the first video relative to the first object; detecting, by theprocessing system, a second action in the second video relative to thesecond object; detecting, by the processing system, a difference betweenthe first action and the second action; and providing, by the processingsystem, a notification indicative of the difference.
 2. The method ofclaim 1, wherein the first object and the second object are a same typeof object.
 3. The method of claim 2, wherein the first object and thesecond object are a same type of body part of the first user and thesecond user.
 4. The method of claim 2, wherein the first object and thesecond object are a same type of equipment.
 5. The method of claim 1,wherein the detecting of the first object is via an object recognitionmodel.
 6. The method of claim 5, wherein the detecting of the firstobject is further in accordance with at least one of: user accountinformation of the second user; or an indication of a task associatedwith the first object.
 7. The method of claim 5, wherein the detectingof the second object is via the object recognition model.
 8. The methodof claim 7, wherein the detecting of the first object and the detectingof the second object are further in accordance with at least one of:user account information of the second user; or an indication of a taskassociated with the first object and the second object.
 9. The method ofclaim 1, wherein the first user engages in the first action via at leasta first force feedback glove, and wherein the second user engages in thesecond action via at least a second force feedback glove.
 10. The methodof claim 9, further comprising: obtaining control signals from the atleast the first force feedback glove; and transmitting the controlsignals to the at least the second force feedback glove, wherein thecontrol signals cause the at least the second force feedback glove toengage at least one actuator of the at least the second force feedbackglove.
 11. The method of claim 1, further comprising: detecting at leasta third object in the second video; and obscuring the at least the thirdobject in the second video.
 12. The method of claim 11, wherein the atleast the third object is determined to not have a relevance to thesecond action in the second video relative to the second object.
 13. Themethod of claim 11, further comprising: detecting at least a fourthobject in the first video; and obscuring the at least the fourth objectin the first video, wherein the at least the fourth object is determinedto not have a relevance to the first action in the first video relativeto the first object.
 14. The method of claim 1, wherein the notificationcomprises a visual indication relative to at least the second object viaan augmented reality display of the second user.
 15. The method of claim1, wherein the first action in the first video relative to the firstobject comprises at least one of: a manipulation of the first objectdetected via at least one of the first video or at least a first forcefeedback glove of the first user; or a manipulation of at least a thirdobject relative to the first object detected via at least one of thefirst video or the at least the first force feedback glove.
 16. Themethod of claim 15, wherein the second action in the second videorelative to the second object comprises at least one of: a manipulationof the second object detected via at least one of the second video or atleast a second force feedback glove of the second user; or amanipulation of at least a fourth object relative to the second objectdetected via at least one of the second video or the at least the secondforce feedback glove.
 17. The method of claim 16, wherein the detectingthe difference between the first action and the second action comprisesdetecting a difference between the manipulation of the first object andthe manipulation of the second object.
 18. The method of claim 16,wherein the detecting the difference between the first action and thesecond action comprises detecting a difference between the manipulationof the at least the third object relative to the first object and themanipulation of the at least the fourth object relative to the secondobject.
 19. A non-transitory computer-readable medium storinginstructions which, when executed by a processing system including atleast one processor, cause the processing system to perform operations,the operations comprising: detecting a first object in a first video ofa first user; detecting a second object in a second video of a seconduser, wherein the first video and the second video are part of a visualcommunication session between the first user and the second user;detecting a first action in the first video relative to the firstobject; detecting a second action in the second video relative to thesecond object; detecting a difference between the first action and thesecond action; and providing a notification indicative of thedifference.
 20. A device comprising: a processing system including atleast one processor; and a computer-readable medium storing instructionswhich, when executed by the processing system, cause the processingsystem to perform operations, the operations comprising: detecting afirst object in a first video of a first user; detecting a second objectin a second video of a second user, wherein the first video and thesecond video are part of a visual communication session between thefirst user and the second user; detecting a first action in the firstvideo relative to the first object; detecting a second action in thesecond video relative to the second object; detecting a differencebetween the first action and the second action; and providing anotification indicative of the difference.